The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words
نویسندگان
چکیده
Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP applications. Unknown words make up 29 % of the word types in in a large Arabic corpus used in this study. With today's corpus sizes exceeding 10 9 words, it becomes impossible to manually check corpora for new words to be included in a lexicon. We develop a finite-state morphological guesser and integrate it with a machine-learning-based pre-annotation tool in a pipeline architecture for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexical database. The processing is performed on a corpus of contemporary Arabic of 1,089,111,204 words. Our method is tested on a manually-annotated gold standard and yields encouraging results despite the complexity of the task. Our work shows the usability of a highly non-deterministic morphological guesser in a practical and complex application.
منابع مشابه
Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization
This paper deals with the automatic construction of a lemmatizer from a Full Form Lemma (FFL) training dictionary and with lemmatization of new, in the FFL dictionary unseen, i.e. out-ofvocabulary (OOV) words. Three methods of lemmatization of three kinds of OOV words (missing full forms, unknown words, and compound words) are introduced. These methods were tested on Czech test data. The best r...
متن کاملAutomatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملHandling Unknown Words in Arabic FST Morphology
A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inc...
متن کاملA New Similarity Measure for Automatic Construction of the Unknown Word Lexical Dictionary
This paper deals with research that automatically constructs a lexical dictionary of unknown words as an automatic lexical dictionary expansion. The lexical dictionary has been usefully applied to various fields for semantic information processing. It has limitations in which it only processes terms defined in the dictionary. Under this circumstance, the concept of “Unknown Word (UW)” is define...
متن کاملIranian EFL Learners’ Lexical Inferencing Strategies at Both Text and Sentence levels
Lexical inferencing is one of the most important strategies in vocabulary learning and it plays an important role in dealing with unknown words in a text. In this regard, the aim of this study was to determine the lexical inferencing strategies used by Iranian EFL learners when they encounter unknown words at both text and sentence levels. To this end, forty lower intermediate students were div...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012